44 research outputs found

    External uniform electric field removing flexoelectric effect in epitaxial ferroelectric thin films

    Full text link
    Using the modified Landau-Ginsburg-Devonshire thermodynamic theory, it is found that the coupling between stress gradient and polarization, or flexoelectricity, has significant effect on ferroelectric properties of epitaxial thin films, such as polarization, free energy profile and hysteresis loop. However, this effect can be completely eliminated by applying an optimized external, uniform electric field. The role of such uniform electric field is shown to be the same as that of an ideal gradient electric field which can suppress the flexoelectricty effect completely based on the present theory. Since the uniform electric field is more convenient to apply and control than gradient electric field, it can be potentially used to remove the flexoelectric effect induced by stress gradient in epitaxial thin films and enhance the ferroelectric properties.Comment: 5 pages, 3 figure

    VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

    Full text link
    End-to-end singing voice synthesis (SVS) model VISinger can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase; glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts; low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP), we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFi-GAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics.Comment: Submitted to ICASSP 202

    Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

    Full text link
    This paper aims to synthesize target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. Specifically, we address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labelled data, style-labelled data, and unlabeled data. To better transfer the fine-grained expressiveness from references to the target speaker in the non-parallel transfer, we introduce a reference-candidate pool and propose an attention based reference selection approach. Extensive experiments demonstrate the good design of our model.Comment: Submitted to ICASSP202

    PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

    Full text link
    Style transfer TTS has shown impressive performance in recent years. However, style control is often restricted to systems built on expressive speech recordings with discrete style categories. In practical situations, users may be interested in transferring style by typing text descriptions of desired styles, without the reference speech in the target style. The text-guided content generation techniques have drawn wide attention recently. In this work, we explore the possibility of controllable style transfer with natural language descriptions. To this end, we propose PromptStyle, a text prompt-guided cross-speaker style transfer system. Specifically, PromptStyle consists of an improved VITS and a cross-modal style encoder. The cross-modal style encoder constructs a shared space of stylistic and semantic representation through a two-stage training process. Experiments show that PromptStyle can achieve proper style transfer with text prompts while maintaining relatively high stability and speaker similarity. Audio samples are available in our demo page

    AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

    Full text link
    Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios due to the challenges raised by the requirement of the lightweight model and less computational complexity. In this paper, a tiny VITS-based TTS model, named AdaVITS, for low computing resource speaker adaptation is proposed. To effectively reduce parameters and computational complexity of VITS, an iSTFT-based wave construction decoder is proposed to replace the upsampling-based decoder which is resource-consuming in the original VITS. Besides, NanoFlow is introduced to share the density estimate across flow blocks to reduce the parameters of the prior encoder. Furthermore, to reduce the computational complexity of the textual encoder, scaled-dot attention is replaced with linear attention. To deal with the instability caused by the simplified model, instead of using the original text encoder, phonetic posteriorgram (PPG) is utilized as linguistic feature via a text-to-PPG module, which is then used as input for the encoder. Experiment shows that AdaVITS can generate stable and natural speech in speaker adaptation with 8.97M model parameters and 0.72GFlops computational complexity.Comment: Accepted by ISCSLP 202

    DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

    Full text link
    Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown their advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in training phase and the predicted spectrograms in inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experimental results show that DSPGAN significantly outperforms the compared approaches and can generate high-fidelity speech based on diverse data in TTS.Comment: Submitted to ICASSP 202

    Energy-Saving and Low-Carbon Gear Blank Dimension Design Based on Business Compass

    No full text
    Sustainable blank dimension design is the key to the implementation of green industrial development. However, blank dimension design only considers the blank production factor of the blank dimension design stage, which cannot guarantee the blank production stage and the use stage’s overall goal. In this paper, based on the guiding thinking of a business compass, a low-carbon and low-energy consumption blank dimension optimization design model was proposed. Taking the process parameters of the production and the use of the blank as the variables, the grey wolf optimization algorithm was adopted to solve the problem. Taking the gear blanks dimension as an example, the optimized blank dimension is 98.6, compared with the standard blank dimension of 100, 105, the energy consumption is 95.7% and 93.1%, the carbon emission is 92.6% and 90.2%, and the material consumption is 96.5% and 87.5%, respectively. The sustainable blank dimension design has obvious advantages in terms of low energy consumption and low carbon, and it can save a lot of materials; it can also promote product sustainability
    corecore